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Abstract 

We analyze the performance of a class of manifold- learning algorithms that find their output 
by minimizing a quadratic form under some normalization constraints. This class consists 
of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment 
(LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions 
on the manifold that are necessary for the success of the algorithms. Both the finite sample 
case and the limit case are analyzed. We show that there are simple manifolds in which the 
necessary conditions are violated, and hence the algorithms cannot recover the underlying 
manifolds. Finally, we present numerical results that demonstrate our claims. 
Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion 
maps, locally linear embedding, local tangent space alignment, hessian eigenmap 



1. Introduction 



Many seemingly complex systems described by high-dimensional data sets are in fact gov- 
erned by a surprisingly low number of parameters. Revealing the low-dimensional repre- 
sentation of such high-dimensional data sets not only leads to a more compact description 
of the data, but also enhances our understanding of the system. Dimension-reducing algo- 
rithms attempt to simplify the system's representation without losing significant structural 
information. Various dimension-reduction algorithms were developed recently to perform 
embeddings for manifold-based data sets. These include the following algorithms: Locally 
Linear Embedding (LLE, Roweis and Saul, 2000), Isomap (Tenenbaum et al. 2000), Lapla- 



cian Eigenmaps (LEM, Belkin and Niyogi 



2003), Local Tangent Space Alignment (LTSA, 



©?? ??. 



Zhang and Zha 2004), Hessian Eigenmap (HLLE, Donoho and Grimes , [2004 1 , Semi-definite 



Embedding (SDE, Weinberger and Saul| 2006) and Diffusion Maps (DFM, Coifman and La- 



fon 2006). 



These manifold- learning algorithms compute an embedding for some given input. It 
is assumed that this input lies on a low-dimensional manifold, embedded in some high- 
dimensional space. Here a manifold is defined as a topological space that is locally equivalent 
to a Euclidean space. It is further assumed that the manifold is the image of a low- 
dimensional domain. In particular, the input points are the image of a sample taken from 
the domain. The goal of the manifold-learning algorithms is to recover the original domain 
structure, up to some scaling and rotation. The non-linearity of these algorithms allows 
them to reveal the domain structure even when the manifold is not linearly embedded. 

The central question that arises when considering the output of a manifold-learning 
algorithm is, whether the algorithm reveals the underlying low- dimensional structure of 
the manifold. The answer to this question is not simple. First, one should define what 
"revealing the underlying lower-dimensional description of the manifold" actually means. 
Ideally, one could measure the degree of similarity between the output and the original 
sample. However, the original low-dimensional data representation is usually unknown. 
Nevertheless, if the low-dimensional structure of the data is known in advance, one would 
expect it to be approximated by the dimension-reducing algorithm, at least up to some 
rotation, translation, and global scaling factor. Furthermore, it would be reasonable to 
expect the algorithm to succeed in recovering the original sample's structure asymptotically, 
namely, when the number of input points tends to infinity. Finally, one would hope that 
the algorithm would be robust in the presence of noise. 

Previous papers have addressed the central question posed earlier. Zhang and Zha ( 2004 ) 
presented some bounds on the local-neighborhoods' error-estimation for LTSA. However, 
their analysis says nothing about the global embedding. Huo and Smith (2006) proved that, 
asymptotically, LTSA recovers the original sample up to an affine transformation. They 
assume in their analysis that the level of noise tends to zero when the number of input points 
tends to infinity. Bernstein et al. (2000.) proved that, asymptotically, the embedding given 



by the Isomap algorithm (Tenenbaum et al. 2000) recovers the geodesic distances between 
points on the manifold. 

In this paper we develop theoretical results regarding the performance of a class of 
manifold-learning algorithms, which includes the following five algorithms: Locally Linear 
Embedding (LLE), Laplacian Eigenmap (LEM), Local Tangent Space Alignment (LTSA), 
Hessian Eigenmaps (HLLE), and Diffusion maps (DFM). 

We refer to this class of algorithms as the normalized-output algorithms. The normalized- 
output algorithms share a common scheme for recovering the domain structure of the input 
data set. This scheme is constructed in three steps. In the first step, the local neighbor- 
hood of each point is found. In the second step, a description of these neighborhoods is 
computed. In the third step, a low-dimensional output is computed by solving some convex 
optimization problem under some normalization constraints. A detailed description of the 
algorithms is given in Section [2] 

In Section [3] we discuss informally the criteria for determining the success of manifold- 
learning algorithms. We show that one should not expect the normalized-output algo- 
rithms to recover geodesic distances or local structures. A more reasonable criterion for 
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success is a high degree of similarity between the output of the algorithms and the original 
sample, up to some affine transformation; the definition of similarity will be discussed later. 
We demonstrate that under certain circumstances, this high degree of similarity does not 
occur. In Section [4] we find necessary conditions for the successful performance of LEM and 
DFM on the two-dimensional grid. This section serves as an explanatory introduction to 
the more general analysis that appears in Section |5j Some of the ideas that form the basis 
of the ana lysis in Section [4] were discussed independently by both Gerber et al. (2007) and 



ourselves (Goldberg et al. 2007). Section [5] finds necessary conditions for the successful per- 
formance of all the normalized-output algorithms on general two-dimensional manifolds. In 
Section [6] we discuss the performance of the algorithms in the asymptotic case. Concluding 
remarks appear in Section [7j The detailed proofs appear in the Appendix. 

Our paper has two main results. First, we give well-defined necessary conditions for 
the successful performance of the normalized-output algorithms. Second, we show that 
there exist simple manifolds that do not fulfill the necessary conditions for the success of 
the algorithms. For these manifolds, the normalized-output algorithms fail to generate 
output that recovers the structure of the original sample. We show that these results 
hold asymptotically for LEM and DFM. Moreover, when noise, even of small variance, is 
introduced, LLE, LTSA, and HLLE will fail asymptotically on some manifolds. Throughout 
the paper, we present numerical results that demonstrate our claims. 

2. Description of Output-normalized Algorithms 

In this section we describe in short the normalized-output algorithms. The presentation 
of these algorithms is not in the form presented by the respective authors. The form used 
in this paper emphasizes the similarities between the algorithms and is better-suited for 



further derivations. In Appendix A.l we show the equivalence of our representation of the 
algorithms and the representations that appear in the original papers. 

Let X = [xi, . . . , xjv]') x i £ J^ 15 be the input data where T> is the dimension of the 
ambient space and iV is the size of the sample. The normalized-output algorithms attempt 
to recover the underlying structure of the input data X in three steps. 

In the first step, the normalized-output algorithms assign neighbors to each input point 
Xi based on the Euclidean distances in the high-dimensional space^J This can be done, 
for example, by choosing all the input points in an r-ball around Xj or alternatively by 
choosing Xj's X-nearest-neighbors. The neighborhood of Xj is given by the matrix Xi = 
[xi, x^i, . . . , xi^k]' where Xjj : j = 1, . . . , K are the neighbors of Xj. Note that K = K(i) 
can be a function of i, the index of the neighborhood, yet we omit this index to simplify 
the notation. For each neighborhood, we define the radius of the neighborhood as 

r(i) = max llx; ,• — x» u\\ (1) 

where we define x^o = Xj. Finally, we assume throughout this paper that the neighborhood 
graph is connected. 



1. The neighborhoods are not mentioned explicitly by Coifman and Lafon (20061. However, since a sparse 



optimization problem is considered, it is assumed implicitly that neighborhoods are defined (see Sec. 2.7 
therein). 
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In the second step, the normalized-output algorithms compute a description of the local 
neighborhoods that were found in the previous step. The description of the i-th neighbor- 
hood is given by some weight matrix W%. The matrices W% for the different algorithms are 
presented. 



LEM and DFM: W* is a K x (K + 1) matrix, 



/ 1/2 
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For LEM Wij = 1 is a natural choice, yet it is also possible to define the weights as 
= e~^ Xi ~ Xi ^^ / £ , where e is the width parameter of the kernel. For the case of 
DFM, 

where k e is some rotation-invariant kernel, q e {xi) = Y2j ^e{ x ii x i,j) an d £ is again a 
width parameter. We will use a = 1 in the normalization of the diffusion kernel, yet 



other values of a can be considered (see details in Coifman and Lafon 2006 ). For both 
LEM and DFM, we define the matrix D to be a diagonal matrix where da = ^ ■ Wij. 

LLE: Wi is a 1 x [K + 1) matrix, 



Wi = [ 1 



The weights Wij are chosen so that Xi can be best linearly reconstructed from its 
neighbors. The weights minimize the reconstruction error function 



A 4 (w i; x, . . . ,w i>K ) = \\xi - y~]wijX it j\ 



(3) 



under the constraint £^ ■ Wij = 1. In the case where there is more than one solution 
that minimizes A*, regularization is applied to force a unique solution (for details, see 



Saul and Roweis, 2003) 



LTSA: Wi is a (K + 1) x (K + 1) matrix, 

W % = (I-PiP')H. 

Let UiLiV/ be the SVD of Xi — \x\ where Xi is the sample mean of Xi and 1 is a 



vector of ones (for details about SVD, see, for example, Golub and Loan 19831. Let 
Pi = [u(i), ■ ■ ■ , um\] be the matrix that holds the first d columns of U% where d is the 
output dimension. The matrix H = I — is the centering matrix. See also 



and Smith (2006) regarding this representation of the algorithm. 



Huo 
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where is a vector of zeros and H % is the — - x K Hessian estimator. 



HLLE: Wi is a d{d + l)/2 X (K + 1) matrix, 

Wi = (0,fT 

d(d+l) 
2 

The estimator can be calculated as follows. Let UiLiV/ be the SVD of X{ — lx^. Let 

Mi = [1, C/f } , . . . , C/f \ diag(C/( 1) ^ (1) '), diag(£/f >'), . . . , diag(C/f >')] , 

where the operator diag returns a column vector formed from the diagonal elements 
of the matrix. Let Mj be the result of the Gram-Schmidt orthonormalization on Mj. 
Then H l is defined as the transpose of the last d{d+ l)/2 columns of Mj. 

The third step of the normalized-output algorithms is to find a set of points Y = 
[yi, . . . , 2/jv]', Hi £ where d < T> is the dimension of the manifold. Y is found by minimiz- 
ing a convex function under some normalization constraints, as follows. Let Y be any N x d 
matrix. We define the i-th neighborhood matrix Yj = [yi, y^i, . . . , using the same pairs 
of indices i,j as in Xj. The cost function for all of the normalized-output algorithms is given 
by 

N N 

*(y) = X)W = EH w * y *llF. ( 4 ) 
i=i i=i 

under the normalization constraints 

{ Y'm -0 f ° r LEM and ° FM ' { ^'l^-O 1 for LLE, LTSA and HLLE, (5) 

where || \\ F stands for the Frobenius norm, and Wi is algorithm-dependent. 

Define the output matrix Y to be the matrix that achieves the minimum of $ under the 
normalization constraints of Eq. [5] (Y is defined up to rotation). Then we have the follow- 
ing: the embeddings of LEM and LLE are given by the according output matrices Y; the 
embeddings of LTSA and HLLE are given by the according output matrices ~^Y] and the 



embedding of DFM is given by a linear transformation of Y as discussed in Appendix |A.l 



The discussion of the algorithms' output in this paper holds for any affine transformation of 
the output (see Section [3]). Thus, without loss of generality, we prefer to discuss the output 
matrix Y directly, rather than the different embeddings. This allows a unified framework 
for all five normalized-output algorithms. 



3. Embedding quality 

In this section we discuss possible definitions of "successful performance" of manifold- 
learning algorithms. To open our discussion, we present a numerical example. We chose 
to work with LTSA rather arbitrarily. Similar results can be obtained using the other 
algorithms. 

The example we consider is a uniform sample from a two-dimensional strip, shown in 
Fig. [T]A. Note that in this example, T> = d; i.e., the input data is identical to the original 
data. Fig. [TJ3 presents the output of LTSA on the input in Fig. [T]^. The most obvious 
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Figure 1: The output of LTSA (B) for the (two-dimensional) input shown in (A), where the 
input is a uniform sample from the strip [0, 1] x [0, 6]. Ideally one would expect 
the two to be identical. The normalization constraint shortens the horizontal dis- 
tances and lengthens the vertical distances, leading to the distortion of geodesic 
distances. (E) and (F) focus on the points shown in black in (A) and (B), re- 
spectively. The blue triangles in (E) and (F) are the 8-nearest-neighborhood of 
the point denoted by the full black circle. The red triangles in (F) indicate the 
neighborhood computed for the corresponding point (full black circle) in the out- 
put space. Note that less than half of the original neighbors of the point remain 
neighbors in the output space. The input (A) with the addition of Gaussian noise 
normal to the manifold and of variance 10 -4 is shown in (C). The output of LTSA 
for the noisy input is shown in (D). (G) shows a closeup of the neighborhood of 
the point indicated by the black circle in (D). 



difference between input and output is that while the input is a strip, the output is roughly 
square. While this may seem to be of no importance, note that it means that the algorithm, 
like all the normalized-output algorithms, does not preserve geodesic distances even up to 
a scaling factor. By definition, the geodesic distance between two points on a manifold is 
the length of the shortest path on the manifold between the two points. Preservation of 
geodesic distances is particularly relevant when the manifold is isometrically embedded. In 
this case, assuming the domain is convex, the geodesic distance between any two points on 
the manifold is equal to the Euclidean distance between the corresponding domain points. 



Geodesic distances are conserved, for example, by the Isomap algorithm (Tenenbaum et al. 



2000). 



Figs. [T^ and[T^ present closeups of Figs. [T]^ and[T)3, respectively. Here, a less obvious 
phenomenon is revealed: the structure of the local neighborhood is not preserved by LTSA. 
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By local structure we refer to the angles and distances (at least up to a scale) between all 
points within each local neighborhood. Mappings that preserve local structures up to a scale 



are called conformal mappings (see for example de Silva and Tenenbaum 2003 Sha and 



Saul 20051. In addition to the distortion of angles and distances, the ET-nearest-neighbors 



of a given point on the manifold do not necessarily correspond to the i^-nearest-neighbors 
of the respective output point, as shown in Figs. [Ip and [Ip. Accordingly, we conclude 
that the original structure of the local neighborhoods is not necessarily preserved by the 
normalized-output algorithms. 

The above discussion highlights the fact that one cannot expect the normalized-output 
algorithms to preserve geodesic distances or local neighborhood structure. However, it 
seems reasonable to demand that the output of the normalized-output algorithms resemble 
an affine transformation of the original sample. In fact, the output presented in Fig. [Tj3 is 
an affine transformation of the input, which is the original sample, presented in Fig. [TjA.. 
A formal similarity criterion based on affine transformations is given by Huo and Smith| 
( )2006D . In the following, we will claim that a normalized-output algorithm succeeds (or 
fails) based on the existence (or lack thereof) of resemblance between the output and the 
original sample, up to an affine transformation. 

Fig. [Tp presents the output of LTSA on a noisy version of the input, shown in Fig. [Tp. In 
this case, the algorithm prefers an output that is roughly a one- dimensional curve embedded 
in M 2 . While this result may seem incidental, the results of all the other normalized-output 
algorithms for this example are essentially the same. 

Using the affine transformation criterion, we can state that LTSA succeeds in recovering 
the underlying structure of the strip shown in Fig. [T]A However, in the case of the noisy 
strip shown in Fig. [Tp, LTSA fails to recover the structure of the input. We note that all 
the other normalized-output algorithms perform similarly. 

For practical purposes, we will now generalize the definition of failure of the normalized- 
output algorithms. This definition is more useful when it is necessary to decide whether an 
algorithm has failed, without actually computing the output. This is useful, for example, 
when considering the outputs of an algorithm for a class of manifolds. 

We now present the generalized definition of failure of the algorithms. Let X — A/vxd 
be the original sample. Assume that the input is given by ip(X) C MP, where ip : R d ^ R v 
is some smooth function, and T> > d is the dimension of the input. Let Y = l^vxd be 
an affine transformation of the original sample X, such that the normalization constraints 
of Eq. [5] hold. Note that Y is algorithm-dependent, and that for each algorithm, Y is 
unique up to rotation and translation. When the algorithm succeeds it is expected that 
the output will be similar to a normalized version of X, namely to Y. Let Z = Z^ X d be 
any matrix that satisfies the same normalization constraints. We say that the algorithm 
has failed if &(Y) > &(Z), and Z is substantially different from Y, and hence also from 
X. In other words, we say that the algorithm has failed when a substantially different 
embedding Z has a lower cost than the most appropriate embedding Y. A precise definition 
of "substantially different" is not necessary for the purposes of this paper. It is enough to 
consider Z substantially different from Y when Z is of lower dimension than Y, as in Fig.JTp. 

We emphasize that the matrix Z is not necessarily similar to the output of the algorithm 
in question. It is a mathematical construction that shows when the output of the algorithm 
is not likely to be similar to Y, the normalized version of the true manifold structure. 
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The following lemma shows that if &(Y) > &(Z), the inequality is also true for a small 
perturbation of Y. Hence, it is not likely that an output that resembles Y will occur when 
<£(Y) > &(Z) and Z is substantially different from Y. 

Lemma 3.1 Let Y be an N x d matrix. Let Y = Y + eE be a perturbation ofY, where E 



is an N x d matrix such that \\E\\ 



1 and where e > 0. Let S be the maximum number 



of neighborhoods to which a single input point belongs. Then for LLE with positive weights 



w 



1,3 > 



LEM, DFM, LTSA, and HLLE, we have 



> (1 - 4s)$(Y) - 4eC a S , 
where C a is a constant that depends on the algorithm. 



The use of positive weights in LLE is discussed in Saul and Roweis (2003 Section 5); a 



similar result for LLE with general weights can be obtained if one allows a bound on the 
values of Wij- The proof of Lemma 3.1 is given in Appendix A. 2 



4. Analysis of the two-dimensional grid 

In this section we analyze the performance of LEM on the two-dimensional grid. In partic- 
ular, we argue that LEM cannot recover the structure of a two-dimensional grid in the case 
where the aspect ratio of the grid is greater than 2. Instead, LEM prefers a one-dimensional 



curve in M. . Implications also follow for DFM, as explained in Section 4.3 followed by a 
discussion of the other normalized-output algorithms. Finally, we present empirical results 
that demonstrate our claims. 

In Section[5]we prove a more general statement regarding any two-dimensional manifold. 
Necessary conditions for successful performance of the normalized-output algorithms on 
such manifolds are presented. However, the analysis in this section is important in itself for 
two reasons. First, the conditions for the success of LEM on the two-dimensional grid are 
more limiting. Second, the analysis is simpler and points out the reasons for the failure of 
all the normalized-output algorithms when the necessary conditions do not hold. 



4.1 Possible embeddings of a two-dimensional grid 

We consider the input data set X to be the two-dimensional grid [— m, . . . ,m] X [— q, . . . ,q], 
where m > q. We denote Xij = (i,j). For convenience, we regard X = (X^\ X^) as an 
N x 2 matrix, where N = (2m + l)(2q + 1) is the number of points in the grid. Note that 
in this specific case, the original sample and the input are the same. 

In the following we present two different embeddings, Y and Z. Embedding Y is the 
grid itself, normalized so that Cov(Y) = /. Embedding Z collapses each column to a point 
and positions the resulting points in the two-dimensional plane in a way that satisfies the 
constraint Cov(Z) = / (see Fig. [2] for both). The embedding Z is a curve in M 2 and clearly 
does not preserve the original structure of the grid. 

We first define the embeddings more formally. We start by defining Y = X(X'DX)^ 1 ^ 2 . 
Note that this is the only linear transformation of X (up to rotation) that satisfies the 
conditions Y'Dl = and Y'DY = L, which are the normalization constraints for LEM (see 
Eq. I5J). However, the embedding Y depends on the matrix D, which in turn depends on the 
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Figure 2: (A) The input grid. (B) Embedding Y, the normalized grid. (C) Embedding Z, 
a curve that satisfies Cov(Z) = /. 



choice of neighborhoods. Recall that the matrix D is a diagonal matrix, where da equals 
the number of neighbors of the i-th point. Choose r to be the radius of the neighborhoods. 
Then, for all inner points Xij, the number of neighbors K(i,j) is a constant, which we 
denote as K. We shall call all points with less than K neighbors boundary points. Note 
that the definition of boundary points depends on the choice of r. For inner points of the 
grid we have da = K. Thus, when K <C N we have X'DX ~ KX'X. 

We define Y = XCov(X)- 1 / 2 . Note that Y'l = 0, Cov(Y) = I and for K < N, 
Y ~ v 7 KNY. In this section we analyze the embedding Y instead of Y, thereby avoiding 
the dependence on the matrix D and hence simplifying the notation. This simplification 
does not significantly change the problem and does not affect the results we present. Similar 
results are obtained in the next section for general two-dimensional manifolds, using the 



exact normalization constraints (see Section 5.2). 

Note that Y can be described as the set of points [—m/a, . . . , m/cr] x [—q/r, . . . , q/r], 
where yij = (i/a,j/r). The constants a 2 = Var(X^ 1 ^) and r 2 = Var(X( 2 )) ensure that 
the normalization constraint Cov(Y) = / holds. Straightforward computation (see Ap- 



pendix A. 3) shows that 



a 



(m + l)m 



+ 1)9 



(6) 



The definition of the embedding Z is as follows: 



-2i 
P 



2i 
p 



?(2) 



(7) 



i > 



where = ^ 2q ^ 2 ensures that Z'l = 0, and a (the same a as before; see below) 

and p are chosen so that sample variance of and Z^ is equal to one. The symmetry of 
Z^ about the origin implies that Cov(Z^\ Z^) = 0, hence the normalization constraint 
Cov(Z) = I holds, a is as defined in Eq. pi since Z^ = (with both defined similarly 



to A^ 1 )). Finally, note that the definition of Zij does not depend on j. 
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4.2 Main result for LEM on the two-dimensional grid 

We estimate &(Y) by Ncj)(Yij) (see Eq. Q, where y^ is an inner point of the grid and Y{j is 
the neighborhood of y%j\ likewise, we estimate &{Z) by N<j>(Z%j) for an inner point Z{j. For 
all inner points, the value of (ft(Yij) is equal to some value <p. For boundary points, ^>iXij) 18 
bounded by <p multiplied by some constant that depends only on the number of neighbors. 
Hence, for large m and q, the difference between $(Y) and N(f>(Yij) is negligible. 
The main result of this section states: 

Theorem 4.1 Let yij be an inner point and let the ratio ^ be greater than 2. Then 

</>(Yij) > 4>{Zij) 

for neighborhood-radius r that satisfies 1 < r < 3, or similarly, for K-nearest neighborhoods 
where K = 4,8,12. 

This indicates that for aspect ratios ^ that are greater than 2 and above, mapping Z, which 
is essentially one-dimensional, is preferred to Y, which is a linear transformation of the grid. 
The case of general r-ball neighborhoods is discussed in Appendix A. 4 and indicates that 
similar results should be expected. 

The proof of the theorem is as follows. It can be shown analytically (see Fig. [3]) that 

mo) = F{K) (^2 + ^) > (8) 

where 

F(4) = 2; F(8) = 6; F(12) = 14 . (9) 

For higher K, F{K) can be approximated for any r-ball neighborhood of y^ (see Ap- 
pendix A. 4 ). 

It can be shown (see Fig. [3J that 

ftZij) = F(K) + , (10) 



where F{K) = F{K) for K = 4, 8, 12. For higher K, it can be shown (see Appendix A. 4) 
that F{K) ~ F{K) for any r-ball neighborhood. 

A careful computation (see Appendix A. 5) shows that 

p>a, (11) 

and therefore 

(j>{Zij) < ■ (12) 

Assume that — > 2. Since both m and q are integers, we have that m + 1 > 2{q + 1). 
Hence, using Eq. [6] we have 

2 = m(m + 1) 4q(q + 1) = 2 
° 3 3 T 

Combining this result with Eqs. [8] and [12] we have 



in 



> 2 4>{Yj) > (^(Zij) . 



which proves Theorem 4.1 
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Figure 3: (A) The normalized grid at an inner point y^j. The 4-nearest-neighbors of y™ are 
marked in blue. Note that the neighbors from the left and from the right are at 
a distance of 1/cr, while the neighbors from above and below are at a distance 
of 1/t. The value of 4>(Yij) is equal to the sum of squared distances of yij to 
its neighbors. Hence, we obtain that 4>{Yij) = 2 /a 2 + 2/r 2 when K = 4 and 
(^(Yij) = 2/a 2 + 2/r 2 + 4(l/cx 2 + 1/r 2 ) when K = 8. (B) The curve embedding at 
an inner point Zjj . The neighbors of z« from the left and from the right are marked 
in red. The neighbors from above and below are embedded to the same point as 
Zij. Note that the squared distance between Zij and £(j±i)j equals 1/cx 2 +4/p 2 . 
Hence, 0(Zy) = 2(l/cr 2 + 4/p 2 ) when K = 4, and 0(Z#) = 6(1/ct 2 + 4/p 2 ) when 
K = 8. 



4.3 Implications to other algorithms 



We start with implications regarding DFM. There are two main differences between LEM 
and DFM. The first difference is the choice of the kernel. LEM chooses w 



hi 



1, which 

can be referred to as the "window" kernel (a Gaussian weight function was also considered 
by Belkin and Niyogi 20031. DFM allows a more general rotation-invariant kernel, which 



includes the "window" kernel of LEM. The second difference is that DFM renormalizes the 
weights k £ (xi, Xij) (see Eq. [2]). However, for all the inner points of the grid with neighbors 
that are also inner points, the renormalization factor (q £ (xi)~ q £ 



is a constant. 



Therefore, if DFM chooses the "window" kernel, it is expected to fail, like LEM. In other 
words, when DFM using the "window" kernel is applied to a grid with aspect ratio slightly 
greater than 2 or above, DFM will prefer the embedding Z over the embedding Y (see Fig[2j. 
For a more general choice of kernel, the discussion in Appendix A. 4 indicates that a similar 
failure should occur. This is because the relation between the estimations of $(Y) and &(Z) 
presented in Eqs.[8]and 10 holds for any rotation-invariant kernel (see Appendix A.4). This 
observation is also evident in numerical examples, as shown in Figs. |4]and[5} 

In the cases of LLE with no regularization, LTSA, and HLLE, it can be shown that 
$(Y) = 0. Indeed, for LTSA and HLLE, the weight matrix Wi projects on a space that 

1 1 2 

is perpendicular to the SVD of the neighborhood Xi, thus HWjXjH^ = 0. Since Y{ = 
XiCov(X)- 1 / 2 , we have HW^H^ = 0, and, therefore, = 0. For the case of LLE 
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Figure 4: The output of LEM on a grid of dimensions 81 x 41 is presented in (A). The 
result of LEM for the grid of dimensions 81 x 39 is presented in (B). The number 
of neighbors in both computations is 8. The output for DFM on the same data 
sets using a = 2 appears in (C) and (D), respectively. 



with no regularization, when K > 3, each point can be reconstructed perfectly from its 
neighbors, and the result follows. Hence, a linear transformation of the original data should 
be the preferred output. However, the fact that $(Y) = relies heavily on the assumption 
that both the input X and the output Y are of the same dimension (see Theorem 5.1 for 
manifolds embedded in higher dimensions), which is typically not the case in dimension- 
reducing applications. 



4.4 Numerical results 

For the following numerical results, we used the Matlab implementation written by the 
respective algorithms' authors as provided by Wittman ( retrieved Jan. 2007| ) (a minor 
correction was applied to the code of HLLE) . 

We ran the LEM algorithm on data sets with aspect ratios above and below 2. We 
present results for both a grid and a uniformly sampled strip. The neighborhoods were 
chosen using isT-nearest neighbors with K = 4, 8, 16, and 64. We present the results for 
K = 8; the results for K = 4, 16, and 64 are similar. The results for the grid and the 
random sample are presented in Figs. [4] and [5] respectively. 

We ran the DFM algorithm on the same data sets. We used the normalization constant 
a = 1 and the kernel width a = 2; the results for a = 1,4, and 8 are similar. The results 
for the grid and the random sample are presented in Figures [4] and [5] respectively. 
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20 40 60 80 




20 40 60 80 



Figure 5: (A) and (D) show the same 3000 points, uniformly-sampled from the unit square, 
scaled to the areas [0,81] x [0,41] and [0,81] x [0,39], respectively. (B) and (E) 
show the outputs of LEM for inputs (A) and (D), respectively. The number of 
neighbors is both computations is 8. (C) and (F) show the output for DFM on 
the same data sets using a = 2. Note the sharp change in output structure for 
extremely similar inputs. 



Both examples clearly demonstrate that for aspect ratios sufficiently greater than 2, both 
LEM and DFM prefer a solution that collapses the input data to a nearly one-dimensional 
output, normalized in M 2 . This is exactly as expected, based on our theoretical arguments. 

Finally, we ran LLE, HLLE, and LTSA on the same data sets. In the case of the grid, 
both LLE and LTSA (roughly) recovered the grid shape for K = 4,8,16, and 64, while 
HLLE failed to produce any output due to large memory requirements. In the case of the 
random sample, both LLE and HLLE succeeded for K = 16, 64 but failed for K = 4,8. 
LTSA succeeded for K = 8, 16, and 64 but failed for K = 4. The reasons for the failure for 
lower values of K are not clear, but may be due to roundoff errors. In the case of LLE, the 
failure may also be related to the use of regularization in LLE's second step. 

5. Analysis for general two-dimensional manifolds 

The aim of this section is to present necessary conditions for the success of the normalized- 
output algorithms on general two-dimensional manifolds embedded in high-dimensional 
space. We show how this result can be further generalized to manifolds of higher dimension. 
We demonstrate the theoretical results using numerical examples. 
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5.1 Two different embeddings for a two-dimensional manifold 



We start with some definitions. Let A = 
Without loss of generality, we assume that 



[x 1 ,...,x N \ , Xi £ 



be the original sample. 



0; Cov(A) = S 







As in Section |4j we assume that a > r. Assume that the input for the normalized-output 
algorithms is given by ip(X) C R v where : R 2 ^ R v is a smooth function and T> > 2 
is the dimension of the input. When the mapping ip is an isometry, we expect ^(A) to be 
small. We now take a close look at ^(A). 



A' 



$(A) = \\WiX.. 



2 

t\\F 



N 

E 

i=l 



WiX. 



(1) 



N 



(2) 



WiX, 



and 



where X\ is the j-th column of the neighborhood Aj. Define e/ - 

(i) 

note that e\ depends on the different algorithms through the definition of the matrices 

(i) 

Wi. The quantity e- is the portion of error obtained by using the j-th column of the 



i-th neighborhood when using the original sample as output. Denote = ^ Yli e i 
average error originating from the j-th. column. 



the 



We define two different embeddings for ip(X), following the logic of Sec. 4.1 Let 

Y = A XT 1 / 2 (13) 



be the first embedding. Note that Y is just the original sample up to a linear transformation 
that ensures that the normalization constraints Cov(y) = / and Y'l = hold. Moreover, Y 
is the only transformation of A that satisfies these conditions, which are the normalization 



constraints for LLE, HLLE, and LTSA. In Section 5.2 we discuss the modified embeddings 
for LEM and DFM. 

The second embedding, Z, is given by 



?(2) 



? (2) 



4 1] < o 



x ( P > 



(14) 



Here 



E 



„(!) 



2\ 1/2 



E 

i:x[ 1] >0 



,(!) 



(l) 



-1/2 



(15) 



(l) 



ensures that Cov^ 1 ), Z®) = 0, and = ^(E,(D> ^ + E x w <0 ^) and p are 

chosen so that the sample mean and variance of Z& are equal to zero and one, respectively. 
We assume without loss of generality that k > 1. 
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Note that Z depends only on the first column of X. Moreover, each point Z( is just 
a linear transformation of scj . In the case of neighborhoods Zi, the situation can be dif- 
ferent. If the first column of Xj is either non-negative or non-positive, then Z{ is indeed 
a linear transformation of X^\ However, if is located on both sides of zero, Z{ is 
not a linear transformation of X^\ Denote by No the set of indices i of neighborhoods 
Zi that are not linear transformations of X^\ The number | iVo | depends on the num- 
ber of nearest neighbors K. Recall that for each neighborhood, we defined the radius 
r{i) = maxj ^, 6 |o v ...^} \\xi,j — a^H- Define r max = maxj g 7v r(i) to be the maximum radius 
of neighborhoods i, such that i £ Nq. 



5.2 The embeddings for LEM and DFM 

So far we have claimed that given the original sample X, we expect the output to resemble 
Y (see Eq. 13). However, Y does not satisfy the normalization constraints of Eq. [5] for the 
cases of LEM and DFM. Define Y to be the only affine transformation of X (up to rotation) 
that satisfies the normalization constraint of LEM and DFM. When the original sample is 
given by X, we expect the output of LEM and DFM to resemble Y. We note that unlike 
the matrix Y that was defined in terms of the matrix X only, Y depends also on the choice 
of neighborhoods through the matrix D that appears in the normalization constraints. 

We define Y more formally. Denote X = X — 11' DX. Note that X is just a 
translation of X that ensures that X'Dl = 0. The matrix X' DX is positive definite and 
therefore can be presented by TT,T' where T is a 2 x 2 orthogonal matrix and 



a 






i 2 



where a > r. Define X = XT; then Y = XT, 1 / 2 is the only affine transformation of X 
that satisfies the normalization constraints of LEM and DFM; namely, we have Y'DY = I 
and Y'Dl = 0. 

We define Z similarly to Eq. 



14 



,w 



< o 



xf ] > 



where k is defined by Eq 

and p 2 = k 2 - (i) - ~ d 
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with respect to X, z^> = jt(Y2 (i 



7^ + £*«<o 



(i) 



'x\ ll >o~" \ x i ) +Tl S: m< dii{x i j. 
A similar analysis to that of Y and Z can be performed for Y and Z. The same necessary 
conditions for success are obtained, with a, r, and p replaced by a, f , and p, respectively. 
In the case where the distribution of the original points is uniform, the ratio ? is close to 
the ratio ^ and thus the necessary conditions for the success of LEM and DFM are similar 



to the conditions in Corollary 5.2 
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5.3 Characterization of the embeddings 



The main result of this section provides necessary conditions for the success of the normalized- 
output algorithms. Following Section [3j we say that the algorithms fail if $(Y) > &(Z), 
where Y and Z are defined in Eqs.[l3]and 14 respectively. Thus, a necessary condition for 



the success of the normalized-output algorithms is that <&(Y) < &{Z). 

Theorem 5.1 Let X be a sample from a two-dimensional domain and let ip{X) be its 
embedding in high- dimensional space. Let Y and Z be defined as above. Then 



+ 



N 



-Car- 



et ' max 



,-(2) 



< 



$(y) > $(z) 



where c a is a constant that depends on the specific algorithm. For the algorithms LEM and 
DFM a more restrictive condition can be defined: 



< 



$(Y) > $(Z) 



For the proof, see Appendix A. 6 



Adding some assumptions, we can obtain a simpler criterion. First note that, in general, 
and e^ should be of the same order, since it can be assumed that, locally, the neighbor- 



hoods are uniformly distributed. Second, following Lemma A. 2 (see Appendix A.8), when 
is a sample from a symmetric unimodal distribution it can be assumed that k ~ 1 and 
p 2 > Then we have the following corollary: 



Corollary 5.2 Let X,Y,Z be as in Theorem 5.1 Let c = cj/t be the ratio between the 
variance of the first and second columns of X. Assume that < \f2f^ 2 \ k < \/2, and 
p 2 > Then 



4 1 + 



I Wo I Car. 



2 

max 



N V2e( 2 ) 
For LEM and DFM, we can write 



< c 



$(Y) > $(Z) 



4 < *(Y) > *(Z) . 



We emphasize that both Theorem 5.1 and Corollary 5.2 do not state that Z is the output 



of the normalized-output algorithms. However, when the difference between the right side 
and the left side of the inequalities is large, one cannot expect the output to resemble the 



original sample (see Lemma 3.1). In these cases we say that the algorithms fail to recover 



the structure of the original domain. 



5.4 Generalization of the results to manifolds of higher dimensions 

The discussion above introduced necessary conditions for the normalized-output algorithms' 
success on two-dimensional manifolds embedded in MP . Necessary conditions for success on 
general <i-dimensional manifolds, d > 3, can also be obtained. We present here a simple cri- 
terion to demonstrate the fact that there are ci-dimensional manifolds that the normalized- 
output algorithms cannot recover. 
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Let X = [X^\ . . . , X^] be a N x d sample from a (i-dimensional domain. Assume that 
the input for the normalized-output algorithms is given by ip(X) C MP where tp : R d -> R v 
is a smooth function and D > d is the dimension of the input. We assume without loss of 
generality that X'l = and that Cov(X) is a diagonal matrix. Let Y = XCov(X)^ 1 ^ 2 . 
We define the matrix Z = [Z^\ . . . , Z^] as follows. The first column of Z, Z^\ equals 
the first column of Y, namely, = Y^\ We define the second column Z^ similarly to 



the definition in Eq. 14 



Z 



(2) 



-CD 



Z& x^KO 



— 1 z^) x) ' > 



where k is defined as in Eq. 15 and and p are chosen so that the sample mean and 
variance Of Z( 2 ) are equal to zero and one, respectively. We define the next d — 2 columns 
of Z by 

^> = rM -^ ZW ; , = 3,...,,, 

i - 4 

where a2j = Z^'Y^i, Note that Z'l = and Cov(Z) = /. Denote a max = max je / 3j m a%j. 
We bound &{Z) from above: 



N 



1 



i=l V ^i/ 3=3 

iV d 



(2) 



< $(rW) + $(z< 2 ') + 



l 



U) 



1 - (J 2 

max i=l j=3 



+ 



cr 



AT d 



EE M 



(2) 



1 - (J 2 

max i=l 3=3 



1 — C 1 — <7 — ' 

"max "max » = 3 



Since we may write &(Y) = Y2j=i &(Y^), we have 



1 + {d f^ $(z' 2 ') < <D(y(2)) + 



1 - a 2 v y v 7 1 -cr 2 



5^$(yC?)) ^ $(z) < $(y). 

" lax j=3 



When the sample is taken from a symmetric distribution with respect to the axes, one 
can expect c max to be small. In the specific case of the d-dimensional grid, <r max = 0. 
Indeed, Y^> is symmetric around zero, and all values of Z^ appear for a given value of 
yC?). Hence, both LEM and DFM are expected to fail whenever the ratio between the 
length of the grid in the first and second coordinates is slightly greater than 2 or more, 
regardless of the length of grid in the other coordinates, similar to the result presented in 



Theorem 4.1 Corresponding results for the other normalized-output algorithms can also 



be obtained, similar to the derivation of Corollary 5.2 
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60/-45 60/-41.9 60/-38.79 60/-35.69 60/-32.59 60/-29.48 



Iv^l E^] tl*^ 




56.9/-45 5B.9/-41.9 56.9/-38.79 56.9/-35.69 56.9/-32.59 5B.9/-29.48 

PI PI PI PI PI P> 

53.79A45 53.79/-41.9 53.79/-3B.79 53.79/-35.69 53.79Z-32.59 53.79A29.48 



PI S^j P^ p^ 



50.69/-45 50.69/-41.9 50.69/-38.79 50.69/-35.69 50.G9/-32.59 50.69/-29.48 



PI PI PI PI PI P"* 



Figure 6: The data sets for the first example appear in panel A. In the left appears the 
1600-point original swissroll and in the right appears the same swissroll, after its 
first dimension was stretched by a factor of 3. The data for the second example 
appear in panel B. In the left appears a 2400-point uniform sample from the 
"fishbowl" , and in the right appears the same "fishbowl" , after its first dimension 
was stretched by a factor of 4. In panel C appears the upper left corner of the 
array of 100 x 100 pixel images of the globe. Above each image we write the 
elevation and azimuth. 
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5.5 Numerical results 



We ran all five normalized-output algorithms, along with Isomap, on three data sets. We 



used the Matlab implementations written by the algorithms' authors as provided by Wittman 



(retrieved Jan. 20071 



The first data set is a 1600-point sample from the swissroll as obtained from Wittman 



(retrieved Jan. 2007 1 . The results for the swissroll are given in Fig. [7) Al-Fl. The results 
for the same swissroll, after its first dimension was stretched by a factor 3, are given in 
Fig. [7J A2-F2. The original and stretched swissrolls are presented in Fig. [6]A The results 
for K = 8 are given in Fi g. [7} W e also checked for K = 12, 16; but "short-circuits" occur (see 
Balasubramanian et al. 2002 for a definition and discussion of "short-circuits"). 





F1 



C2 



□2 



J V\J 



F2 



Figure 7: The output of LEM on 1600 points sampled from a swissroll is presented in Al. 

The output of LEM on the same swissroll after stretching its first dimension by a 
factor of 3 is presented in A2. Similarly, the outputs of DFM, LLE, LTSA, HLLE, 
and Isomap are presented in Bl-2, Cl-2, Dl-2, El-2, and Fl-2, respectively. We 
used K = 8 for all algorithms except DFM, where we used a = 2. 



The second data set consists of 2400 points, uniformly sampled from a "fishbowl" , where 
a "fishbowl" is a two-dimensional sphere minus a neighborhood of the northern pole (see 
Fig. [6f3 for both the "fishbowl" and its stretched version). The results for K = 8 are given 
in Fig. [8j We also checked for K = 12, 16; the results are roughly similar. Note that the 
"fishbowl" is a two-dimensional manifold embedded in M 3 , which is not an isometry. While 
our theoretical results were proved under the assumption of isometry, this example shows 
that the normalized-output algorithms prefer to collapse their output even in more general 
settings. 
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Figure 8: The output of LEM on 2400 points sampled from a "fishbowl" is presented in Al. 

The output of LEM on the same "fishbowl" after stretching its first dimension by a 
factor of 4 is presented in A2. Similarly, the outputs of DFM, LLE, LTSA, HLLE, 
and Isomap are presented in Bl-2, Cl-2, Dl-2, El-2, and Fl-2, respectively. We 
used K = 8 for all algorithms except DFM, where we used a = 2. 



The third data set consists of 900 images of the globe, each of 100 x 100 pixels (see 
Fig. |6p). The images, provided by Hamm et al. (20051, were obtained by changing the 



globe's azimuthal and elevation angles. The parameters of the variations are given by 
a 30 x 30 array that contains —45 to 45 degrees of azimuth and —30 to 60 degrees of 
elevation. We checked the algorithms both on the entire set of images and on a strip of 
30 x 10 angular variations. The results for K = 8 are given in Fig. [9j We also checked for 
K = 12, 16; the results are roughly similar. 

These three examples, in addition to the noisy version of the two-dimensional strip dis- 
cussed in Section [3] (see Fig.JT]), clearly demonstrate that when the aspect ratio is sufficiently 
large, all the normalized-output algorithms prefer to collapse their output. 



6. Asymptotics 

In the previous sections we analyzed the phenomenon of global distortion obtained by the 
normalized-output algorithms on a finite input sample. However, it is of great importance 
to explore the limit behavior of the algorithms, i.e., the behavior when the number of input 
points tends to infinity. We consider the question of convergence in the case of input that 
consists of a (i-dimensional manifold embedded in MP , where the manifold is isometric to a 
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Figure 9: The output of LEM on the 30 x 30 array of the globe rotation images is presented 
in Al; the output of LEM on the array of 30 x 10 is presented in A2. Similarly, 
the outputs of DFM, LLE, LTSA, HLLE, and Isomap are presented in Bl-2, 
Cl-2, Dl-2, El-2, and Fl-2 respectively. We used K = 8 for all algorithms 
except DFM, where we chose a to be the root of the average distance between 
neighboring points. 



convex subset of Euclidean space. By convergence we mean recovering the original subset 
of M. d up to a non-singular affine transformation. 



Some previous theoretical works presented results related to the convergence issue. Huo 



and Smith (2006) proved convergence of LTSA under some conditions. However, to the 



best of our knowledge, no proof or contradiction of convergence has been given for any 
other of the normalized-output algorithms. In this section we discuss the various algo- 
rithms separately. We also discuss the influence of noise on the convergence. Using the 
results from previous sections, we show that there are classes of manifolds on which the 
normalized-output algorithms cannot be expected to recover the original sample, not even 
asymptotically. 



6.1 LEM and DFM 

Let x\,X2, ... be a uniform sample from the two-dimensional strip S = [0,L] x [0, 1]. Let 
X n = [x%, . . . ,x n ]' be the sample of size n. Let K = K{n) be the number of nearest 
neighbors. Then when K <C n there exists with probability one an N, such that for all 
n > N the assumptions of Corollary |5.2| hold. Thus, if L > 4 we do not expect either LEM 
or DFM to recover the structure of the strip as the number of points in the sample tends 
to infinity. Note that this result does not depend on the number of neighbors or the width 
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of the kernel, which can be changed as a function of the number of points n, as long as 
K <C n. Hence, we conclude that LEM and DFM generally do not converge, regardless of 
the choice of parameters. 

In the rest of this subsection we present further explanations regarding the failure of 



LEM and DFM based on the asymptotic behavior of the graph Laplacian (see Belkin and 



Niyogi 2003 for details). Although it was not mentioned explicitly in this paper, it is well 
known that the outputs of LEM and DFM are highly related to the lower non-negative 



eigenvectors of the normalized graph Laplacian matrix (see Appendix A.l). It was shown 



by Belkin and Niyogi ( 2005 1 , Hein et al. ( 2005 ) , and Singer ( |2006 1 that the graph Laplacian 
operator converges to the continuous Laplacian operator. Thus, taking a close look at the 
eigenfunctions of the continuous Laplacian operator may reveal some additional insight into 
the behavior of both LEM and DFM. 

Our working example is the two-dimensional strip S = [0,L] x [0, 1], which can be con- 
sidered as the continuous counterpart of the grid X defined in Section [4} Following Coifman 
and Lafon (20061 we impose the Neumann boundary condition (see details therein). The 
eigenfunctions fij(xi, X2) and eigenvalues Ajj on the strip S under these conditions are 
given by 



<Pi,j{x 1 ,X2) 



COS 



T 



X\ I COS (j7TX2) A 



'■J 



+ (jvr) 2 for i,j = 0,1,2, 



When the aspect ratio of the strip satisfies L > M E N, the first M non-trivial eigenfunctions 
are ifiOi i = 1, . • ■ ,M, which are functions only of the first variable x\. Any embedding 
of the strip based on the first M eigenfunctions is therefore a function of only the first 
variable x\. Specifically, whenever L > 2 the two-dimensional embedding is a function of 
the first variable only, and therefore clearly cannot establish a faithful embedding of the 
strip. Note that here we have obtained the same ratio constant L > 2 computed for the 
grid (see Section [4] and Figs. |4]and|5]) and not the looser constant L > 4 that was obtained 
in Corollary |5.2| for general manifolds. 



6.2 LLE, LTSA and HLLE 



As mentioned in the beginning of this section, Huo and Smith (2006) proved the convergence 



of the LTSA algorithm. The authors of HLLE proved that the continuous manifold can be 



recovered by finding the null space of the continuous Hessian operator (see Donoho and 



Grimes, 2004 Corollary). However, this is not a proof that the algorithm HLLE converges. 
In the sequel, we try to understand the relation between Corollary |5.2| and the convergence 
proof of LTSA. 

Let X\,X2,-- - be a sample from a compact and convex domain Q in M 2 . Let X n = 
[x\, . . . , x n ]' be the sample of size n. Let tp be an isometric mapping from M? to MP, where 
T> > 2. Let ip(X n ) be the input for the algorithms. We assume that there is an N such 



that for all n > N the assumptions of Corollary 5.2 hold. This assumption is reasonable, 



for example, in the case of a uniform sample from the strip S. In this case Corollary 5.2 
states that $>(Z n ) < <&(Y n ) whenever 



4 1 + 



n 



V2ei 2) 



< On 
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where c n is the ratio between the variance of Xn and X^ 2 ' assumed to converge to a 
constant c. The expression is the fraction of neighborhoods Xi jH such that X^J is 
located on both sides of zero. r maX) „ is the maximum radius of neighborhood in no- Note 
that we expect both and r maXira to be bounded whenever the radius of the neighborhoods 

I I (2) 

does not increase. Thus, Corollary 5.2 tells us that if {e„ } is bounded from below, we 



cannot expect convergence from LLE, LTSA or HLLE when c is large enough. 

The consequence of this discussion is that a necessary condition for the convergen ce o * 

(2) I 

LLE, LTSA and HLLE is that {e^ } (and hence, from the assumptions of Corollary 5.2 
also {en^}) converges to zero. If the two-dimensional manifold ip(Q) is not contained in 
a linear two-dimensional subspace of MP, the mean error en is typically not zero due 
to curvature. However, if the radii of the neighborhoods tend to zero while the number of 
points in each neighborhood tends to infinity, we expect e n — ► for both LTSA and HLLE. 
This is because the neighborhood matrices W{ are based on the linear approximation of the 
neighborhood as captured by the neighborhood SVD. When the radius of the neighborhood 
tends to zero, this approximation gets better and hence the error tends to zero. The same 
reasoning works for LLE, although the use of regularization in the second step of LLE may 
prevent from converging to zero (see Section [2]) . 

We conclude that a necessary condition for convergence is that the radii of the neighbor- 
hoods tend to zero. In the presence of noise, this requirement cannot be fulfilled. Assume 
that each input point is of the form ip(xi) + s% where e% £ MP is a random error that is 
independent of Ej for j ^ i. We may assume that £j ~ iV(0,a 2 /), where a is a small 
constant. If the radius of the neighborhood is smaller than a, the neighborhood cannot be 
approximated reasonably by a two-dimensional projection. Hence, in the presence of noise 
of a constant magnitude, the radii of the neighborhoods cannot tend to zero. In that case, 
LLE, LTSA and HLLE might not converge, depending on the ratio c. This observation 
seems to be known also to |Huo and Smith who wrote: 



"... we assume a = o(r); i.e., we have " — > 0, as r — ► 0. 

It is reasonable to require that the error bound (a) be smaller than the size of 
the neighborhood (r), which is reflected in the above condition. Notice that this 
condition is also somewhat nonstandard, since the magnitude of the errors is 
assumed to depend on n, but it seems to be necessary to ensure the consistency 
of LTSA."0 

Summarizing, convergence may be expected when n — ► 00, if no noise is introduced. If 
noise is introduced and if ct/t is large enough (depending on the level of noise a), convergence 
cannot be expected (see Fig. [TJ. 

7. Concluding remarks 

In the introduction to this paper we posed the following question: Do the normalized-output 
algorithms succeed in revealing the underlying low-dimensional structure of manifolds em- 
bedded in high-dimensional spaces? More specifically, does the output of the normalized- 
output algorithms resemble the original sample up to affine transformation? 



2. We replaced the original r and a with r and a respectively to avoid confusion with previous notations. 
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The answer, in general, is no. As we have seen, Theorem 5.1 and Corollary |5 . 2 1 show that 
there are simple low-dimensional manifolds, isometrically embedded in high-dimensional 
spaces, for which the normalized-output algorithms fail to find the appropriate output. 
Moreover, the discussion in Section [6] shows that when noise is introduced, even of small 
magnitude, this result holds asymptotically for all the normalized-output algorithms. We 
have demonstrated these results numerically for four different examples: the swissroll, the 
noisy strip, the (non- isometrically embedded) "fishbowl" , and a real- world data set of globe 
images. Thus, we conclude that the use of the normalized-output algorithms on arbitrary 
data can be problematic. 

The main challenge raised by this paper is the need to develop manifold-learning algo- 
rithms that have low computational demands, are robust against noise, and have theoreti- 
cal convergence guarantees. Existing algorithms are only partially successful: normalized- 
output algorithms are efficient, but are not guaranteed to converge, while Isomap is guar- 
anteed to converge, but is computationally expensive. A possible way to achieve all of the 
goals simultaneously is to improve the existing normalized-output algorithms. While it is 
clear that, due to the normalization constraints, one cannot hope for geodesic distances 
preservation nor for neighborhoods structure preservation, success as measured by other 



criteria may be achieved. A suggestion of improvement for LEM appears in Gerber et al. 



(2007), yet this improvement is both computationally expensive and lacks a rigorous con- 
sistency proof. We hope that future research finds additional ways to improve the existing 
methods, given the improved understanding of the underlying problems detailed in this 
paper. 
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Appendix A. Detailed proofs and discussions 

A.l The equivalence of the algorithms' representations 

For LEM, note that according to our representation, one needs to minimize 

N N K 

$(y) = y, mm = x;z>ijiifi - yi,j\ 



m,j\\yi - yij\ l2 

i=l i=l j=l 



under the constraints Y'Dl = and Y'DY = I. Define w rs = w r j if s is the j-th neighbor 
of r and zero otherwise. Define D to be the diagonal matrix such that d rr = Y2a=l ^ rs ' n °t e 
that D = D. Using these definitions, one needs to minimize &(Y) = ^ r s w rs \\y r — Us\\ 2 

under the constraints Y'Dl = and Y'DY = I, which is the the authors' representation 
of the algorithm. 
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For DFM, as for LEM, we define the weights w rs . Define the N x N matrix W = (w rs ). 
Define the matrix D~ l W, note that this matrix is a Markovian matrix and that v (°) = 1 is 
its eigenvector corresponding to eigenvalue 1, which is the largest eigenvalue of the matrix. 
Let v( p \ p = l,...,d be the next d eigenvectors, corresponding to the next d largest 
eigenvalues X p , normalized such that v^'Dv^ = 1. Note that the vectors v^°\ . . . , are 
also the eigenvectors of / — D~ 1 W corresponding to the d+ 1 lowest eigenvalues. Thus, the 
matrix [v^\ . . . ,v^] (up to rotation) can be computed by minimizing tr (Y'(D — W)Y) 
under the constraints Y'DY = I and Y'Dl = 0. Simple computation shows (see Belkin 
Eq. 3.1) that tr (Y'(D - W)Y) = \Y, r ,s™rs\\yr ~ ^P- We already 
= /Cr s'&rsWyr — Us\\ 2 - Hence, minimizing tr (Y'(D — W)Y) under the 
= I and Y'Dl = is equivalent to minimizing <&(Y) under the same 



and Niyogi 



2003 



showed that $(Y) 
constraints Y'DY 

constraints. The embedding suggested by Coifman and Lafon (20061 (up to rotation) is 

1,(1) . v (d) 



the matrix 



Ai 



Note that this embedding can be obtained from the 



Ui)| > • • • > "a 

output matrix Y by a simple linear transformation. 

For LLE, note that according to our representation, one needs to minimize 



JV 



N 



$(Y) = J2 \\WiYi\\ 2 F = J2\W> 



i=l 



K 
3=1 



under the constraints Y'l = and Cov(Y) = /, which is the minimization problem given 



by Roweis and Saul (2000). 



The representation of LTSA is similar to the representation that appears in the original 



paper, differing only in the weights' definition. We defined the weights Wi following Huo 



and Smith (2006), who showed that both definitions are equivalent. 



For HLLE, note that according to our representation, one needs to minimize 



N 



N 



SKY) = ^ \\WiYifp = (Y-HlHiYi) 



i=i 



i=l 



under the constraint Cov(Y) = /. This is equivalent (up to a multiplication by y/(N)) to 
minimizing tr (Y'TCY) under the constraint Y'Y = I, where Ti is the matrix that appears in 
the original definition of the algorithm. This minimization can be calculated by finding the 



d + 1 lowest eigenvectors of TC, which is the embedding suggested by Donoho and Grimes 
(2001. 
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A. 2 Proof of Lemma 13.11 

We begin by estimating &(Y). 

N N K 

= Y,^ Yi+£WiE ^F = Y,Y,\\ Wi y^ +£Wie ^ 2 ( 16 ) 

i=l i=l j=0 

N K 

^ EE (\\WiVv\\ 2 -2e\(WiVij)'Wieij\) 

i=l j=0 
N K 

> EE " 4£ ) \\WmJ 2 -te\\WieiJ 2 

i=l j=0 

N N 

i2 

i=l i=l 
N 

i2 

1 1 F ' 

i=l 



= (l-4e) j^||Wiyi||^-4£ 

i=i i=i 

iV 

> (l-4e)*(y)-4e^||W i ||^||^ i 



where ej denotes the j-th row of E^. 

1 1 2 

We bound ||Wi||^ for each of the algorithms by a constant C a . It can be shown that for 
LEM and DFM, C a < 2K; for LTSA, C a < K; for HLLE C a < For LLE in the case 



of positive weights Wi we have C a < 2. Thus, substituting C a in Eq. 16 we obtain 

N K 

*<X) > (l-4e)cI>(y)-4eC a ^^||e ij || 2 

i=\ j=0 

> (1 - 4c)*(y) - 4eC a S = (1 - 4e)*(y) - 4eC a( S . 

The last inequality holds true since S is the maximum number of neighborhoods to which 
a single observation belongs. 



A. 3 Proof of Eq. © 

By definition a 2 = Var(XW) and hence 

m q 



°- = jf E E (*8> 

i=—m j=~q 



D(2o + i) E E * 



2 



m <? 

2 



(2m + l)(2g + 

' i=—m3=—q 

m 

Z x " -2 



2m + _ 

i=i 



tE 



2 (2m + l)(m + l)m 
2m + 1 6 
(m + l)m 
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The computation for r is similar. 

A. 4 Estimation of F(K) and F(K) for a ball of radius r 

Calculation of (f>(Y{j) for general K can be different for different choices of neighborhoods. 
Therefore, we restrict ourselves to estimating <j)(Yij) when the neighbors are all the points 
inside an r-ball in the original grid. Recall that (j>(Yij) for an inner point is equal to the 
sum of the squared distance between yij and its neighbors. The function 



agrees with the squared distance for points on the grid, where x\ and x 2 indicate the 
horizontal and vertical distances from Xij in the original grid, respectively. We estimate 
4>(Yij) using integration of f(xi,x 2 ) on B(r), a ball of radius r, which yields 

f 7rr 4 / 1 1 \ 

<f>( Y W ~ / f{xi,x 2 )dx 1 dx 2 = — I + ^2 ) • ( 17 ) 

Of+z|)<r 2 

Thus, we obtain F(K) ps 

We estimate (j>(Zij) similarly. We define the continuous version of the squared distance 
in the case of the embedding Z by 

2 ( 1 4 

g{x 1 ,x 2 ) = x{ ^ + -g 

Integration yields 

<f>(Zij) & J g(xi,x 2 )dxidx2 = ( ^ + ^ ) ( l S) 

(a;f+x|)<r 2 



Hence, we obtain F(K) ~ and the relations between Eqs. |8]and 10 are preserved for a 
ball of general radius. 

For DFM, a general rotation-invariant kernel was considered for the weights. As with 



Eqs. 17 and 18 the approximations of 4>(Yij) and (f>(Zij) for the general case with neigh- 



borhood radius r are given by 




f(xi, x 2 )k(xi, x 2 )dx\dx 2 = I 7r / k(t 2 )t 3 dt 

(xl+x 2 2 )<r 2 V 0<t<r 

and 

{x\+xl)<r 2 V 0<t<r 

Note that the ratio between these approximations of <j>(Yij) and (f>(Zij) is preserved. In 
light of these computations it seems that for the general case of rotation-invariant kernels, 
4>{Yij) > (ft(Zij) for aspect ratio sufficiently greater than 2. 
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A.5 Proof of Eq. [TT 



Direct computation shows that 



0_ gj+ig E(2i 



2m(m + 1) 
(2m + l)p 



Recall that by definition p ensures that Var(Z( 2 )) = 1. Hence, 



1 m 1 



i=—mj=—q 



2 4m(m + l)(2m + 1) 4m 2 (m + 1) 



2m + 1 6p 2 (2m + l) 2 p 2 

4m(m+l) 4ra 2 (m + l) 2 



3p 2 (2m + l) 2 p 2 ' 



Further computation shows that 



4(m + l) 2 m 2 
(2m + l) 2 ' 



(m + l)m > 



Hence, 



2 4(m + l)m 



P > 



(m + l)m = o 1 . 



A. 6 Proof of Theorem 15. II 

The proof consists of computing <&(Y) and bounding &(Z) from above. We start by com- 
puting $(F). 



A 



A 



A? 



'(2) 



i=l i=l 

eW e( 2 ) 



i=l 



iV^r + iV- 



(7 



T 



2 • 



The computation of &(Z) is more delicate because it involves neighborhoods Z% that are 
not linear transformations of their original sample counterparts. 



A 



A 



$(z) = y, \\ w i z i\\ 2 F = J2 II WiZ i 1] ° + Y,\\ WiZ > 



A' 



,(2) 



i=l 

r e(D 

(7 



1=1 



i=l 



< N^+N 



+ E 



(2) 



ieN 



(20) 
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Note that maxj fcg r : _ t x} — < nr{i)/ p. Hence, using Lemma A.l we get 



WiZ. 



(2) 



c a K 2 r (i) 2 
P 2 



< 



(21) 



where c a is a constant that depends on the specific algorithm. Combining Eqs. 20 and 21 
we obtain 

$(Z) < N— + N^^- + \N \c a r 2 max ^ . 

a z p A p z 



In the specific case of LEM and DFM, a tighter bound can be obtained for 
that for LEM and DFM 

2 



WiZ 



(2) 



. Note 



WiZ 



(2) 



j= lK 



J 2 ) _ J 2 )\2 



.k 2 ( 2 ) j2), 2 _ K2 m 



Eft, , 



2 e * • 



Combining Eq. 19 and the last inequality we obtain in this case that 

, . e« «2 g (i) 
a z p z 

which completes the proof. 
A. 7 Lemma lA.ll 

Lemma A.l Let Xi = [xi, xn, . . . , Xi k]' be a local neighborhood. Let r\ 

— mELXj ^ 1 1 x^ j x 2 /jj 1 1 • 

Then 

\\WiXiWl < c a r 2 , 

where c a is a constant that depends on the algorithm. 

Proof We prove this lemma for each of the different algorithms separately. 
• LEM and DFM: 



\WiXiWjr 



K 

E 

i=i 



K 



2 

i ' 



where the last inequality holds since Wij < 1. Hence c a = K. 
LLE: 

K 2 



\WiXiYp 



3=1 



< 



1 K 

72 E W Xi >3 



K 2 



3=1 



< 



x-\\ 2 < T ± 



i * 

*E< 

3=1 
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where the first inequality holds since Wij were chosen to minimize "^i,j{ x 



Hence c a = 1/K. 
• LTSA: 



\WiXif F 



< 



- PiPbHXtWi <\\(i- PtPbwi \\HXiWi 

K^2\\x, 



- I|2 ^ tj-2 2 

Xi\\ <Kr\. 



The first equality is just the definition of Wi (see Sec. |2j). The matrix / — PiP[ is a 
projection matrix and its square norm is the dimension of its range, which is smaller 
than K + 1. Hence c a = K 2 . 



HLLE: 



\WiXi\\ 2 F = \\WiHXi\\ 2 F < \\Wi\\ F \\HXi\\ 2 F < d(d + 1} (K + l)r, 



The first equality holds since WiH = Wi(I — 7? 11') = Wi, since the rows of Wi are 



orthogonal to the vector 1 by definition (see Sec. 



2 ) . Hence c„ 



d(d+l) 



(K + l) 



A. 8 Lemma IA.2I 

Lemma A. 2 Let X be a random variable symmetric around zero with unimodal distribu- 

i 



tion. Assume that Var(X) = a 2 . Then Var(\X\) > 



Proof First note that that the equality holds for X ~ U(—y/3a,y/3o~), where U denotes 
the uniform distribution. Assume by contradiction that there is a random variable X, 

2 

symmetric around zero and with unimodal distribution such that Var(| X\) < \ — £, where 
e > 0. Since Var(|A|) = E{\X\ 2 ) - E{\X\) 2 , and E{\X\ 2 ) = E{X 2 ) = Var(X) = a 2 , we 
have£(|X|) 2 > ^+e. 

We approximate X by X n , where X n is a mixture of uniform random variables, defined as 

follows. Define X n ~ YA=iPi U (~ c i> c i) where Pi > °> Et=iP? = L Note that E ( X n) = 
and that Var(X n ) = Ya=1 P? ( c f) 2 /3- For l ar g e enough n, we can choose pf and c™ such 
that Var(A„) = a 2 and E(\X - X n \) < 2 e{\x\) • 

Consider the random variable \X n \. Note that using the definitions above we may write 
\X n \ = ^i=iPiU(p,c™), hence E(\X n \) = | Yli=\ Pi" c ?- We bound this expression from 
below. We have 

E{\X n \) 2 = E{\X n -X + X\f > (E{\X\) -E{\X n -X\)) 2 (22) 
> E{\X\) 2 -2E{\X\)E(\X n -X\) > 



Let X n . x = EtiP^Ui-cr 1 ,^- 1 ) where 



pf i < n — 1 

Pn-1 +Pn i = U-l 



30 



The Price of Normalization 



and 

f cf i < n — 1 

„n— 1 



Note that Var(X n _i) = a 2 by construction and -Y n _i is symmetric around zero with uni- 
modal distribution. Using the triangle inequality we obtain 

^ n— 1 1 ?i 

E(\Xn-l\) = - Y.P^^' 1 > 2 E^ c " = E W) ■ 

i=l i=l 



Using the same argument recursively, we obtain that > E'dAnI). However, X\ 

Since by Eq. El E(\X n \) 2 > 3f 



U(-V3a,V3a) and hence ^(|Xi|) 2 = Since by Eq. |22l S(|X n |) 2 > ^ we have a 



contradiction. 
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